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Hi, I'm Andreas! 

• Today's topic: the Jaguar CPU architecture 

• Microarchitecture matters! 

• Code doesn't run in a vacuum 

• Low-level knowledge improves high-level designs 

• x86 doesn't mean "stop caring" 
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The Jaguar: What we already know 

• Well rounded 

• Not many crazy pitfalls 

• Out of order execution 

• Sounds easy! 
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Anonymous programmers on 000 

• "Memory access isn't a problem with 000" 

• "Branches aren't a problem with 000" 

• "SIMD isn't necessary on 000" 

• What's true? False? 

• We (optimizers) need 000 intuition, badly 



Disclaimers 


• This talk contains micro-optimization material 

• Don't start here and expect results 

• Take a deep breath 

• Step back, consider the whole problem 

• Re-organize data before resorting to micro-opts 

• Micro-optimize only where it makes sense 

• Special sauce for special circumstances 
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000 and instruction fetching 



• A Jaguar core is always* fetching instructions 

• It can decode up to 2 macro-ops / cycle 

• Most instructions decode to one macro-op 

• AVX instructions most notably take 2 

• Macro-ops split into micro-ops 


* Assuming space in all relevant buffers 

* Assuming no II or ITLB misses 



1 1 Ik 
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Macro vs micro ops 



add eax, ebx => 1 macro-op, 1 micro-op 

add eax, [m] => 1 macro-op, 2 micro-ops 

add [m] , eax => 1 macro-op, 2 (!) micro-ops 


But where does it fetch from? 


• Branch prediction and OOO are intertwined 

• If the core doesn't know for sure, it'll guess 

• This is called speculative execution 

• If it guesses incorrectly, things suck 

• Fortunately, it's a pretty good guesser* 


* Assuming programmer doesn't ignore microarchitecture 
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Will it execute speculatively into 

• ...branches? 

• Yes 

• ...direct function calls? 

• Yes 

• ...indirect (virtual/pointer) function calls? 

• Yes, scarily 
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Getting fetching wrong 

• Illusion of correctness is maintained, of course 

• But the errant instructions will affect the cache 

• Loads and line reservations for writes 

• No way to "undo" - visible on other cores too 

• Especially bad for "branchy" data structures 

• Tree-like data with pointers 
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Branchy data structure example 

struct Node 

{ 

Node *left; 

Node *right; 

BigData bigData; 


GDC 


GAMK OEVILOPf RS CONFERENCE March 14-10. 2016 tx(X3 March 16-10. 2016 *GDC16 



Branchy data structure example 

void DoSomethingExpensiveToNodes(Node* n, int f) 

{ 

int decide = SomehowDecideChildCn , f); // h igh latency 

Misprediction central 
= Bad guesses galore 


if (decide < 0) { 

DoSomethingExpensiveToNodes(n->left) ; 

} else if (decide > 0) { 

DoSomethingExpensiveToNodes(n->right) ; 

} else { 

// Do something expensive to n->bigData 

} 


} 
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Retiring 

• All instructions retire (commit) in program order 

• That is, their effects are visible from outside the core 

• Retirement happens at a max rate of 2/cycle 

• They can also be killed instead of retired 

• For example due to branch mispredictions as we saw 



i ilk 
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The Battle of North Bridge 


D1 Hit 
L2 Hit 
Memory 





>= 25 c 


at least 200 c 


L2 misses, still a thing? 

• A load from RAM will not retire for 200+ cycles 

• So what? 

• OOO can reorder around long latencies, right? 

• Sure, but: Always Be Fetching 

• The frontend issues 2 instructions / cycle.. 
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L2 misses & RCU tag team #fail 

• L2 miss followed by low-latency instructions 

• Cache hits, simple/vector ALU etc etc 

• Remember, 2 per cycle! 

• RCU fills up in < 32 cycles, and we're wedged 

• In practice less, because macro ops != instructions 

• Result: ~150+ cycles wasted stalling 

• Only the L2 miss retiring will free up RCU space 


Micro-optimizing L2 misses 

• Re-schedule instructions 

• Move independent instructions with long latencies to 
right after a load that is likely to miss 

• The longer the latencies, the more it softens the blow 

• Square roots, divides and reciprocals 

• And other (independent) loads! 
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Poor load organization 

void MyRoutine(A* ap, B* bp) 

{ 

float a = ap->A; // LZ miss 
< prep work > // RCU stall risk 


float b = bp->B; // LZ miss! 

< prep work > // Moar RCU stall 


< rest of routine > 


} 
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Better load organization 

void MyRoutineAlt(A* ap, B* bp) 

{ 

float a = ap->A; // LZ miss 

float b = bp->B; // LZ miss (probably "free") 

< prep work > 

< prep work > 

< rest of routine > 


} 
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L2 misses on Jaguar in practice 



• 000 doesn't fundamentally solve RAM latency 

• The window is way too small 

• Making it bigger has other problems 

• Try to issue loads together to overlap misses 

• Hedging our bets in case more than one miss 

• Can overlap up to 8 L2 misses on single core 

• Key improvement over IOE, with some effort 


Warming up: Unrolling 

• Classical optimization technique 

• Idea: reduce loop management overhead 

• Very important on PS3/X360 in-order CPUs 

• Heavily employed by compilers on x86 too 

• Clang *loves* unrolling, as we'll see 

• Let's add some integers from an array 

• Does unrolling help? 
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Simple Unrolling, Scalar Base Version 


. loop : 

add 

eax, [rdi] 


lea 

rdi, [rdi + 4] 


dec 

esi 


jnz 

. loop 
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Simple Unrolling, Scalar 2x unroll 


add 

eax, 

[ rdi 

+ 

0] 

add 

eax, 

[rdi 

+ 

4] 

lea 

rdi , 

[ rdi 

+ 

8] 

dec 

esi 




jnz 

. loop 
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Simple Unrolling, Scalar 4x unroll 


add 

eax, 

[ rdi 

+ 

0] 

add 

eax, 

[rdi 

+ 

4] 

add 

eax, 

[ rdi 

+ 

8] 

add 

eax. 

[rdi 

+ 

12] 

lea 

rdi , 

[ rdi 

+ 

16] 

dec 

esi 




jnz 

. loop 
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Simple Unrolling, Scalar 8x unroll 


add 

eax, 

[ rdi 

+ 

0] 

add 

eax, 

[ rdi 

+ 

4] 

add 

eax. 

[rdi 

+ 

8] 

add 

eax. 

[ rdi 

+ 

12 ] 

add 

eax. 

[rdi 

+ 

16] 

add 

eax. 

[ rdi 

+ 

20] 

add 

eax. 

[ rdi 

+ 

24 ] 

add 

eax. 

[rdi 

+ 

28] 

lea 

rdi , 

[ rdi 

+ 

32] 

dec 

esi 




jnz 

. loop 
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Scalar Unroll Performance, 1024 elems 
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Scalar Loop Performance Analysis 

• You get to talk to cache once per cycle 

• (Once for reading, once for writing) 

• Each add needs one cache transaction to read 

• 1 024 x 32-bit read 

• Each will have 3 cycles D1$ latency 

• Fully overlaps to 1026 cycles of pure cache latency 

• 1026 = best possible latency this loop can ever have 


Scalar bottleneck 


• The base loop bottlenecks on frontend 

• 2 instructions/cycle means 50% ALU utilization 

• We have 4 instructions in the loop, only one add 

• As we unroll we shift the bottleneck to load unit 

• Getting closer and closer to the 1026 best case 

• At 3x unroll we have an ideal steady state 

• Any more is a waste of .text bytes 

Ilk'! 
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What about SIMD? 



. loop : 


vpaddd 

xmmO , xmmO , 

[rdi] 

; unroll 

more vpaddd 

here , 

lea 

rdi, [rdi + 

0x10] 

dec 

rsi 


jnz 

. loop 
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Results 
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SIMD analysis 

• Uses full 128 bit/cycle D1$ bandwidth 

• 4x improvement over scalar code 

• Not primarily because the adds are parallel! 

• Unrolling helps the same way as for scalar case 

• Shifts emphasis from FE to LS 


Unrolling Takeaways 

• Unrolling can help very simple loops 

• By shifting emphasis from frontend to other ports 

• Frontend is relatively weak at 2 insns/cycle 

• The cache can deliver 128 bits per cycle 

• Scalar code uses only a fraction of that bandwidth 

• SIMD code has a natural edge scalar can't touch 
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What if we do it in C? 


uint32_t UnrollTestC ( const uint32_t* nums, size_t count) 

{ 

uint32_t sum = 0; 
while (count — ) 

{ 

sum += *nums++; 

} 

return sum; 


} 
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Meet clang, unroller extraordinaire 


UnrollTestC (unsigned int const*, unsigned long): 


xor 

eax, eax 


test 

rsi, rsi 


je 

0000000000000 0 ABh 


mov 

rax, rsi 


and 

rax, OFFFFFFFFFFFFFFFOh 

mov 

r8 , rsi 


vpxor 

xmmO , xmmO , xmmO 


mov 

rex, rsi 


and 

r8 , OFFFFFFFFFFFFFFFOh 

je 

000000000000005Fh 


sub 

rex, rax 


lea 

rdx, [rdi+r8*4 ] 


add 

rdi , 30h 


vpxor 

xmmO , xmmO , xmmO 


mov 

rax, r8 


vpxor 

xmml , xmml , xmml 


vpxor 

xmm2 , xmm2 , xmm2 


vpxor 

xmm3 , xmm3 , xmm3 


vpaddd 

xmmO , xmmO , xmmword 

ptr [ rdi-30h ] 

vpaddd 

xmml , xmml , xmmword 

ptr [ rdi-2 Oh ] 

vpaddd 

xmm2 , xmm2 , xmmword 

ptr [rdi-lOh] 

vpaddd 

xmm3 , xmm3 , xmmword 

ptr [rdi] 

aaa 

ra^^TTT^™ 


add 

rax, OFFFFFFFFFFFFFFFOh 


jne 0000000000000040h 

jmp 0000000000000071h 

mov rdx,rdi 

xor r8d,r8d 

vpxor xmml , xmml , xmml 

vpxor xmm2 , xmm2 , xmm2 

vpxor xmm3 , xmm3 , xmm3 

vpaddd xmmO , xmml , xmmO 

vpaddd xmmO , xmm2 , xmmO 

vpaddd xmmO , xmm3 , xmmO 

vmovhlps xmml , xmmO , xmmO 

vpaddd xmmO , xmmO , xmml 

vphaddd xmmO , xmmO , xmmO 

vmovd eax , xmmO 

cmp r8,rsi 

je 0000000000000 0 ABh 

nop word ptr cs : [ rax+rax+0 ] 

add eax,dword ptr [rdx] 

add rdx, 4 

dec rex 

jne OOOOOOOOOOOOOOAOh 

ret 
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Clang output analysis 

• Clang unrolls to 4x SIMD 

• Achieves theoretical best case in this case 

• So compilers are great at this stuff, right? 

• Sometimes.. One sample is not enough. 
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Unrolling in General 

• Typically doesn't help more complicated loops 

• Any added latency anywhere shifts the balance 

• 000 is a hardware loop unroller! 

• The hardware will run head into "future" iterations of 
the loop, issuing them speculatively 

• Only if everything is in cache and all ops are simple 
will FE dominate the loop performance 
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Jaguar Unrolling Guidelines 



• Turn to SIMD before you unroll scalar code 

• Use SSE (with VEX encoding), but not AVX 

• Unroll to gather data to run at full SIMD width 

• E.g. Unroll 32-bit fetching gather loop 4 times 

• Then process in 128-bit SIMD registers 


Prefetching 

• Required on PPC console era chips 

• Sprinkle in loops and reap benefits! 

• x86 also offers prefetch instructions 

• PREFETCHTO/1/2 - Vanilla prefetches 

• PREFETCHNTA - Non-temporal prefetch 

• Use _mm_prefetch(addr, _MM_HINT_xxx) 

• So, should we use prefetches on Jaguar? 



Linked List Example 


struct GameObject { 
int m_Health; 

GameObject *m_Next; 

// other members . . . 

}; 


int CountDeadOb jects (GameObject* head) 

{ 

int dead_count = 0; 
while (head) { 

GameObject *next = head->m_Next ; 

__mm_pref etch ( next , _MM_HINT_T0 ) ; 

dead_count += head->m_Health == 0 ? 1 : 0 ; 
head = next; 

} 

return dead_count; 



i ilk 
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Linked List Asm, clang 


CountDeadOb jects : 


. done 


push 

rbp 


mov 

rbp, rsp 


xor 

eax, eax 


test 

rdi, rdi 


. align 

4, 0x90 


: mov 

rex, qword ptr 

[rdi + 8 

prefetchtO 

byte ptr [rex] 


cmp 

dword ptr [rdi] 

, 1 

adc 

eax, 0 


test 

rex, rex 


mov 

rdi, rex 


jne 

. loop 


pop 

rbp 

dead co 


Neat! 


ret 



Cold -- Warm 

Linked List Prefetching, Light Loop 
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"We chased pointers, and I helped !" 




Linked List Results 


• This type of prefetching is useless 

• No time for prefetch to actually help 

• Linked lists turn OOO into in-order 

• 100% bound by memory latency 

• Next pointer to fetch is hidden in memory 

• No way for CPU to run ahead and get data early 

• Also renders hardware array prefetchers useless 
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Basic Array Example 

• Consuming data linearly from RAM 

• No dependent pointers involved 

• Does prefetching help? 
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Basic Array Example 



struct Node { 

// data (24 bytes) 

}; 


Node* base = . . . ; 

for (size_t i = 0; i < count; ++i) 
// Compute based on base[i]; 


} 


Cycles 
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Light Array Workload Analysis 

• Prefetching in the cold case doesn't help 

• 000 does it better, more cheaply than we can 

• Short loops will be running 4+ unrolls ahead 

• Prefetching in the warm case actually hurts 

• Adds useless ops for the FE to decode 

• Adds load unit traffic that limit 000 "unrolling" 

• Hardware figures this out itself without "help" 
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Heavy Array Workload 

• Let's do some more number crunching 

• Enough that we're compute bound in theory 

• Does prefetching help in this case? 


Cycles 
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Basic Array Prefetching 

• Don't waste time on this 

• Jaguar loves arrays 

• The CPU has dedicated prefetchers (Both D1$ + L2!) 

• 000 will execute ahead and issue loads too 

• It's very hard to improve on basic array 
performance using prefetches 

• But you can definitely hurt it! 


Mixed Workload Example 

• Walking array elements with two pointers 

• struct Node { Secondary *pl, *p2 ; } 

• Compute based on data fetched from both 

• Does prefetching help? 

• Light workload - a couple of ALU instructions 

• Heavy workload - 100s of cycles of ALU latency 


Cycles 
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Heavy workload, Pointer Chasing 
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Mixed Workload Results 


• Prefetch can win when there is a lot of ALU 

• Preventing OOO scheduler from fetching ahead 

• Prefetching helps as in the "good old days" 

• In practice this isn't a super common setup 

• More bang for the buck to minimize pointers 


Jaguar Prefetching Guidelines 

• Never prefetch basic arrays 

• Actually hurts warm cache case with short loops 

• Prefetch only heavy array/pointer workloads 

• Need work to overlap the latency of the prefetch 

• Non-intuitive to reason about 

• Best to add close to gold when things are stable 

• Always measure, never assume! 
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Practical: Linear Searching 

• How to best search an unsorted array? 

• Jaguar micro-optimization exercise 

• Assume everything is in D1 cache 

• Assume searching unsorted 32-bit numbers 

• Assume we just need found/not found result 

• Assume we expect to find something 99% of the time 

•Need to scan about half the array if early outing 
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The Naive Approach 



bool ArraySearchNaiveC (uint32_t needle, const uint32_t haystack[], int count) 

{ 

for (int i = 0; i < count; ++i) 

{ 

if (needle == haystackfi]) 

{ 

return true; 

} 

} 

return false; 


GDC 


GAME DEVELOPING CONFERENCE March 14-16 2016 Expo 


clang output 


ArraySearchNaiveC : 
xor 
mov 
test 
jle 
nop 

.loop: mov 

cmp 
je 
inc 
cmp 

jl 

.fail: xor 

.success: ret 


r 



ecx,ecx 
eax,0 
edx,edx 
. fail 

dword ptr [rax+rax+0] 

ai,i ◄ Wat 

dword ptr [rsi+rcx*4] ,edi 

. success 

rex 

ecx,edx 

.loop 

eax,eax 
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Naive performance 



Naive 


Array Size 
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The naive approach, 1980s style 


ROD STEWART 

OUT OF ORDER 


repne scasd 




Isn't x86 something else? 




Naive performance vs REPNE SCASD 



Naive 
— scasd 


Array Size 


Wat loses 


• That redundant mov cost clang the win 

• In the "naive" category 

• Loops this tight are extremely heavy on FE 

• Remember: max 2 decodes / cycle 

• Additional instructions cause significant perf drops 

• String instructions can easily be beat though 


PPC-optimized approach 

• We were using a remnant from our PS3 engine 

• Unroll cluster of 4 compares 

• Merge and branch once per cluster 

• Way better on PPU 

• How does it perform on Jaguar? 
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FVB2_Loop: 

v0 = list[0] ; 

vl = list[l] ; 

v2 = list[2] ; 

vB = list[3] ; 

list += 4; 

v0 = v0 A value; 

vl = vl A value; 

v2 = v2 A value; 

v3 = v3 A value; 

v0 = v0 I (-v0); 

vl = vl I (-vl) ; 

v2 = v2 I (-v2); 

v3 = v3 I (-v3); 

v0 = v0 & vl; 

v2 = v2 & v3; 

if C(v0 & v2) == 0) goto FV32_Found; 
if (list !=loop_term) goto FV32_Loop; 
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PPC-optimized performance 


Naive — PPC 



Array Size 
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PPC-optimized aftermath 




• Old in-order optimizations not always clear wins 

• Watch out for trading ALU for less branching 

• Can remove 000 "unrolling" in tight loops 

• Latency chains become longer in general 

• The 4 cluster branching wins after -32 elements 

• Should be able to do better.. 
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Let's search the whole array! 

• Idea: Make it more predictable 

• Always the same work for a certain array size 

• Should be simpler to reason about? 
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Whole Array Search 


bool ArraySearchWholeArray ( uint32_t needle , 

{ 

bool found = false; 

for ( int i = 0; i < count; ++i) 

{ 

found |= needle == haystack[i]; 

} 

return found; 


} 



const uint32_t haystack[], int count) 



Whole Array Search Performance 



Naive 
- Whole 


Array Size 


Cycles 



Whole Array Search Performance 

500 i 

400 - 



Naive 
- Whole 


Array Size 


100 



clang: "Let me unroll that for you..." 


ArraySearchWholeArray (unsigned int, unsigned int 

vpcmpeqd 

xmm4 , xmm8 , xmmword ptr [rcx-40h] 

sub 

edx, ecx 

const*, int): 


vpshufb 

xmm4 , xmm4 , xmml 1 

nop 

word ptr cs : [ rax+rax+0 ] 

test 

edx, edx 

vmovlhps 

xmm4 , xmm4 , xmm7 

emp 

dword ptr [rsi], edi 

jle 

0000000000000 17 lh 

vpcmpeqd 

xmm7 , xmml 0 , xmmword ptr [rcx-lOh] 

sete 

cl 

lea 

eax, [ rdx-1 ] 

vpshufb 

xmm7 , xmm7 , xmml 1 

or 

al,cl 

lea 

r8 , [ rax+1 ] 

vpcmpeqd 

xmm5 , xmm8 , xmmword ptr [rcx-20h] 

add 

rsi, 4 

xor 

ecx, ecx 

vpshufb 

xmm5 , xmm5 , xmml 1 

dec 

edx 

mov 

r9 , IFFFFFFEOh 

vmovlhps 

xmm5 , xmm5 , xmm7 

jne 

0000000000000160h 

vpxor 

xmmO , xmmO , xmmO 

vpcmpeqd 

xmm7 , xmml 0 , xmmword ptr [rcx+lOh] 

jmp 

0000000000000173h 

vxorps 

xmml , xmml , xmml 

vpshufb 

xmm7 , xmm7 , xmml 1 

xor 

eax, eax 

vxorps 

xmm2 , xmm2 , xmm2 

vpcmpeqd 

xmm6 , xmm8 , xmmword ptr [ rex ] 

and 

al, 1 

vxorps 

xmm3 , xmm3 , xmm3 

vpshufb 

xmm6 , xmm6 , xmml 1 

ret 


and 

r9,r8 

vmovlhps 

xmm6 , xmm6 , xmm7 



je 

00000 0000 0000 HDh 

vpor 

xmmO , xmmO , xmm9 



vmovd 

xmmO , edi 

vorps 

xmml , xmml , xmm4 



vpshufd 

xmmO , xmmO , 0 

vorps 

xmm2 , xmm2 , xmm5 



vinsertf 128 

ymm8 , ymmO , xmmO , 1 

vorps 

xmm3 , xmm3 , xmm6 



lea 

rex, [ rsi+60h] 

sub 

rex, 0FFFFFFFFFFFFFF80h 



inc 

rax 

add 

rax, OFFFFFFFFFFFFFFEOh 



and 

rax, OFFFFFFFFFFFFFFEOh 

jne 

OOOOOOOOOOOOOOAOh 



vpxor 

xmmO , xmmO , xmmO 

mov 

rex, r9 



vextractf 128 

xmml 0 , ymm8 , 1 

vorps 

xmmO , xmml , xmmO 



vmovdqa 

xmml 1 , xmmword ptr [ . . . ] 

vorps 

xmmO , xmm2 , xmmO 



vxorps 

xmml , xmml , xmml 

vorps 

xmmO , xmm3 , xmmO 



vxorps 

xmm2 , xmm2 , xmm2 

vmovhlps 

xmml , xmmO , xmmO 



vxorps 

xmm3 , xmm3 , xmm3 

vorps 

xmmO , xmmO , xmml 



nop 

dword ptr [rax+0] 

vpshufd 

xmml , xmmO , 1 



vpcmpeqd 

xmm7 , xmml 0 , xmmword ptr [rcx-50h] 

vpor 

xmmO , xmmO , xmml 



vpshufb 

xmm7 , xmm7 , xmml 1 

vpalignr 

xmml , xmmO , xmmO , 2 



vpcmpeqd 

xmm4 , xmm8 , xmmword ptr [rex- 6 Oh] 

vpor 

xmmO , xmmO , xmml 



vpshufb 

xmm4 , xmm4 , xmml 1 

vpextrb 

rax, xmmO , 0 



vmovlhps 

xmm9 , xmm4 , xmm7 

emp 

r8 , rex 



vpcmpeqd 

xmm7 , xmml 0 , xmmword ptr [rcx-30h] 

je 

0000000000000173h 



vpshufb 

xmm7 , xmm7 , xmml 1 

lea 

rsi, [rsi+rcx*4 ] 





■ nk 
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So, compilers are great at this? 

• Not always... 

• Highly variable performance in this version 

• Long scalar fixup loop at the end 

• We can easily do better ourselves 
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Let's try that again 

bool ArraySearchSimd ( uint32_t needle, const uint32_t haystack[], int count) 

I i , Ltt i , ■ i ■ ■ 

ml28i n = _mm_setl_epi32 ( needle ) ; 

ml28i mask = _mm_setzero_sil28 ( ) ; 

int aligned_count = count & ~3; 
int straggler_count = count & 3; 

int i ; 


for (i = 0; i < aligned_count ; i += 4 ) { 

ml28i val = _mm_loadu_sil28 (( const ml28i* ) (haystack + i)); 

ml28i cmpmask = _mm_cmpeq_epi32 ( val , n) ; 

mask = _mm_or_s i 1 2 8 (mask, cmpmask); 


// Stragglers 

uint32_t straggler_mask_int = straggler_count ? ~0u « (4 - straggler_count ) : 0; 

ml28i smO = _mm_cvtsi32_sil28 ( straggler_mask_int ) ; 

ml28i sml = _mm_unpacklo_epi8 ( smO , smO ) ; 

ml28i sm2 = _mm_unpacklo_epil 6 ( sml , sml); 

ml28i val = _mm_loadu_sil28 (( const ml28i* ) (haystack + count - 4)); 

ml28i cmpmask = _mm_and_sil28 (_mm_cmpeq_epi32 ( val, n), sm2 ) ; 

mask = _mm_or_sil28 (mask, cmpmask); 

return _mm_movemask_ps (_mm_castsil2 8_ps (mask ) ) ; 


i 
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Let's try that again 


ArraySearchSimd(uns: 

Lgned . 

Int , 

, unsigned int const*, 

int) : 



00000000000001C0 

C5 

F9 

6E 

Cl 



vmovd 

xmmO , edi 



00000000000001C4 

C5 

F9 

70 

CO 

00 


vpshuf d 

xmmO , xmmO , 0 



00000000000001C9 

89 

D1 





mov 

ecx, edx 



00000000000001CB 

83 

El 

FC 




and 

ecx, OFFFFFFFCh 



00000000000001CE 

C5 

FI 

EF 

C9 



vpxor 

xmml , xmml , xmml 



00000000000001D2 

31 

CO 





xor 

eax, eax 



00000000000001D4 

85 

C9 





test 

ecx, ecx 



00000000000001D6 

7E 

19 





jle 

0000000000000 IF lh 



00000000000001D8 

31 

FF 





xor 

edi,edi 



0000000000000 IDA 

66 

OF 

IF 

44 

00 

00 

nop 

word ptr [rax+rax+0] 


00000000000001E0 

C5 

F9 

76 

14 

BE 


vpcmpeqd 

xmm2 , xmmO , xmmword 

ptr 

[rsi+rdi*4 ] 

0000000000000 1E5 

C5 

FI 

EB 

CA 



vpor 

xmml , xmml , xmm2 



0000000000000 IE 9 

48 

83 

C7 

04 



add 

rdi, 4 



0000000000000 1ED 

39 

CF 





cmp 

edi , ecx 



0000000000000 1EF 

7C 

EF 





ji 

0000000000000 IE Oh 



00000000000001F1 

89 

D7 





mov 

edi , edx 



00000000000001F3 

83 

E7 

03 




and 

edi, 3 



00000000000001F6 

74 

OE 





je 

0000000000000206h 



00000000000001F8 

B9 

04 

00 

00 

00 


mov 

ecx, 4 



00000000000001FD 

29 

F9 





sub 

ecx, edi 



00000000000001FF 

B8 

FF 

FF 

FF 

FF 


mov 

eax, OFFFFFFFFh 



0000000000000204 

D3 

EO 





shl 

eax, cl 



0000000000000206 

C5 

F9 

6E 

DO 



vmovd 

xmm2 , eax 



00000000000002 OA 

C5 

E9 

60 

D2 



vpunpcklbw 

xmm2 , xmm2 , xmm2 



00000000000002 OE 

C5 

E9 

61 

D2 



vpunpcklwd 

xmm2 , xmm2 , xmm2 



0000000000000212 

48 

63 

C2 




movsxd 

rax, edx 



0000000000000215 

C5 

F9 

76 

44 

86 

FO 

vpcmpeqd 

xmmO , xmmO , xmmword 

ptr 

[ rsi+rax*4-10h 

00000000000002 IB 

C5 

F9 

DB 

C2 



vpand 

xmmO , xmmO , xmm2 



00000000000002 IF 

C5 

FI 

EB 

CO 



vpor 

xmmO , xmml , xmmO 



0000000000000223 

C5 

F8 

50 

CO 



vmovmskps 

rax, xmmO 



0000000000000227 

85 

CO 





test 

eax, eax 



0000000000000229 

OF 

95 

CO 




setne 

al 



000000000000022C 

C3 






ret 






SIMD performance 



Naive 
- Whole 
SIMD 


Array Size 
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f 


What about binary search? 

500 


400 
| 300 
§ 200 


100 

0 



O 

CM 


O 


o 

CD 


o 

CO 


o 

o 


o 

CM 



Naive 

SIMD 

BinarySearch 


Array Size 



Small Search Guidelines 


• Naive code is reasonable for small counts 

• Because OOO runs Excel faster! 

• Prefer SIMD for predictable <100 elem searches 

• Binary search competitive >1 00 32-bit elements 

• Scrutinize older micro-optimization closely 

• Make sure the compiler is playing for your team 

• Auto-vectorization generates terrible code sometimes 
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Measuring latency in cycles 



• Need a way to synchronize 000 machinery 

• Retire all pending instructions, prevent scheduling 

• CPUID fits the bill - has fixed cost 

• Use RDTSC to read time stamp counter 

• RDTSCP doesn't actually retire all pending 
instructions, can't use it. (See AMD errata.) 

• Assumes platform has cycle TSCs (check yours) 
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Measuring, code 

• Use CPUID/RDTSC/CPUID sandwich 

• Subtract fixed cost later during reporting 


xor 

eax, 

eax 


cpuid 



• 

r 

rdtsc 



• 

r 

shl 

rdx, 

32 


lea 

r 15 , 

[rax + rdx] 

• 

r 

xor 

eax. 

eax 


cpuid 



• 

r 


retire + prevent issues 
read TSC into edx:eax 

combine to 64-bit quantity, 
retire + prevent issues 


save in r!5 
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Measuring latency 

• Warm up II by calling the code first 

• Run multiple tests to avoid interference 

• Even consoles have interrupts, OS shenanigans 

• Clear cache by using _mm_clflush() in a loop 


000 Intuition 


• Jaguar 000 is a loop unroller 

• Up to 64-or-so instructions 

• Jaguar 000 is a prefetcher 

• And even fetches loads speculatively down branches 
you haven't taken yet! 

• Jaguar 000 doesn't solve memory latency 

• But overlapping L2 misses is a big deal 
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Anonymous programmers corrected 

• "Memory access isn't a problem with 000" 
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Anonymous programmers corrected 

• "M e mory acc e ss i sn't a prob le m w i th 000" 

• It still is. Overlap your loads! 
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Anonymous programmers corrected 

• "M e mory acc e ss i sn't a prob le m w i th 000" 

• It still is. Overlap your loads! 

• "Branches aren't a problem with 000" 
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Anonymous programmers corrected 



• "M e mory acc e ss i sn't a prob le m w i th 000" 

• It still is. Overlap your loads! 

• "Branches aren't a prob l em w i th 000" 

• They still are. Avoid trees & speculative cache 
pollution. 
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Anonymous programmers corrected 



• "M e mory acc e ss i sn't a prob le m w i th 000" 

• It still is. Overlap your loads! 

• "Branches aren't a prob l em w i th 000" 

• They still are. Avoid trees & speculative cache 
pollution. 


"SIMD isn't necessary on 000" 


Anonymous programmers corrected 

• "M e mory acc e ss i sn't a prob le m w i th 000" 

• It still is. Overlap your loads! 

• "Branches aren't a prob l em w i th 000" 

• They still are. Avoid trees & speculative cache 
pollution. 

• "S I MD i sn't n e c e ssary on 000" 

• It's the only way to get the full cache bandwidth! 


Takeaways for Jaguar Perf 

• Unrolling, prefetching are of limited use 

• Measure carefully, consider maintenance aspects 

• Use arrays 

• Really, really, really consider using an array 

• Linked lists turns OOO into in-order disaster 

• Use SIMD 

• See my talk from last year for more meat 


Resources 

• Software Optimization Guide for AMD Family 
16h Processors (AMD, pdf) 

• http://www.agner.Org/optimize/#manuals 

• "JAGUAR" AMD’s Next Generation Low Power 
x86 Core, Jeff Rupley, AMD Fellow 


Thank you! - Q & A 

email: afredriksson@insomniacgames.com 
twitter: @deplinenoise 

Special thanks to: 

Mark Cerny 
Fabian Giesen 
Jonathan Adamczewski 

Mike Acton & the rest of the Insomniac Core team 
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Bonus: Hot D1, Cold L2 

• Jaguar has an inclusive cache hierarchy 

• All D1/I1 lines must also be in L2 

• L2 hears about all D1 misses 

• L2 hears nothing about D1 hits 



So what if you have a routine that does nothing 
but HIT D1? 

i iik 
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Bonus: Hot D1, Cold L2 

• Net effect: White hot D1 data can be evicted 

• L2 assoc = 16 lines, they WILL be reused 

• Our data looks old in the LRU order and the L2 hasn't 
heard about it for a while.. 

• End game: Inner loop has to L2 miss all the way to main 
memory randomly to get back its really hot data 

• In practice not a big deal, but can definitely show up 


